Add TurboQuant KV cache compression (3-bit, 4.6x)#1067
Add TurboQuant KV cache compression (3-bit, 4.6x)#1067arozanov wants to merge 4 commits intoml-explore:mainfrom
Conversation
Implements TurboQuant (arXiv 2504.19874) KV cache compression: - PolarQuant: randomized Hadamard rotation + Lloyd-Max codebook - Bit-packed uint32 storage (3-bit: 10 values per word) - Fused Metal kernels for quantize and dequantize - Incremental decode buffer for O(1) per-step cost - Layer-adaptive mode: FP16 for first/last N layers Usage: generate_step(prompt, model, turbo_kv_bits=3) Results (Qwen2.5-32B, M4 Pro 48GB): - 4.6x compression, 0.98x FP16 speed, identical quality - 16K context: 4.2GB → 897MB KV cache
|
I tried this branch on GLM-4.7-Flash-REAP-23B-A3B-mlx-nvfp4 - it outputs garbage, on main branch it works fine |
That's unexpected - this branch shouldn't change default behavior, it only adds new files and optional parameters. Are you using the default generate() or did you pass turbo_kv_bits? If default, there might be a formatting issue from pre-commit that touched generate.py - I'll check. |
This is the script I used: |
Ah got it, you are using turbo_kv_bits=4. That's expected to have quality issues on the K tensor - I've seen the same thing. Try turbo_kv_bits=3 which actually works better (counterintuitively, the 3-bit codebook fits the post-rotation Gaussian distribution better than 4-bit for K. Also for a 23B model try increasing turbo_fp16_layers=2 or turbo_fp16_layers=4 to keep more layers in full precision. |
Found it. Your config turbo_kv_bits=4, turbo_fp16_layers=1 should work on most models, but MoE architectures like GLM-4.7-Flash might need more FP16 layers. Try turbo_fp16_layers=4 or turbo_fp16_layers=6. On my 7B tests, both 3-bit and 4-bit with fp16_layers=1 produce clean output. |
whith params you suggested same garbage issue: |
Yeah MLA is a different beast, makes sense it breaks. Good catch. Qwen3.5 should be fine since it's standard attention. Let me know how it goes. |
Did more testing:
|
Thanks for testing across architectures. MLA and SSM are expected - TurboQuant only works with standard multi-head attention KV cache. I should add a check that raises a clear error instead of silently producing garbage. Will fix. |
|
some thoughts for DX:
|
Good points, agree on all of them. Specifically:
|
Adds --turbo-kv-bits flag (1-4) to compress stored prefix cache entries using TurboQuant (arXiv 2504.19874). 3-bit gives 4.6x compression vs FP16, compared to ~2x from the existing 8-bit quantization. Integration points: - memory_cache.py: _turbo_quantize_cache/_dequantize_cache, memory estimation, trim support, needs_dequantize property, config validation - scheduler.py: turbo_kv_bits in SchedulerConfig, propagation to MemoryCacheConfig - cli.py: --turbo-kv-bits for serve and bench commands Requires mlx-lm with TurboQuant support (ml-explore/mlx-lm#1067).
Adds --turbo-kv-bits flag (1-4) to compress stored prefix cache entries using TurboQuant (arXiv 2504.19874). 3-bit gives 4.6x compression vs FP16, compared to ~2x from the existing 8-bit quantization. Integration points: - memory_cache.py: _turbo_quantize_cache/_dequantize_cache, memory estimation, trim support, needs_dequantize property, config validation - scheduler.py: turbo_kv_bits in SchedulerConfig, propagation to MemoryCacheConfig - cli.py: --turbo-kv-bits for serve and bench commands Requires mlx-lm with TurboQuant support (ml-explore/mlx-lm#1067).
a087778 to
9315fbc
Compare
Summary
Adds TurboQuant KV cache compression (arXiv 2504.19874, ICLR 2026) as a new cache type for mlx-lm.
generate_step(prompt, model, turbo_kv_bits=3)How it works
Results
Qwen2.5-32B-Instruct-4bit (M4 Pro 48GB):
Context scaling (32B, 3-bit):
Qwen2.5-7B with layer-adaptive (1+1 FP16 layers):
Usage
Files added
mlx_lm/models/turboquant_cache.py— TurboQuantKVCache (compatible with_BaseCache)mlx_lm/models/turboquant_rotation.py— Walsh-Hadamard Transformmlx_lm/models/turboquant_packing.py— Bit-packing utilitiesmlx_lm/models/turboquant_metal.py— Fused Metal quantize/dequantize kernelsmlx_lm/models/turboquant_kernels.py— Parallel Metal dequant kernelmlx_lm/generate.py— Addedturbo_kv_bitsandturbo_fp16_layersparametersRelated
Test plan